Whilst base R plots are quick and useful for examining our data, they don’t always offer the flexibility and attractive customization options that we’d like for a presentation or manuscript. This is where ggplot2 comes in.

This session will teach you the basics of using ggplot2 to visualize data in R. ggplot was developed in 2005 by Hadley Wickham as an open source data visualization package for R. With ggplot2, you can create plots that range from simple scatter diagrams to complex custom plots that are (almost) completely customizable.

Resources:

https://towardsdatascience.com/guide-to-data-visualization-with-ggplot2-in-a-hour-634c7e3bc9dd

https://ggplot2.tidyverse.org/reference/

https://www.youtube.com/@Riffomonas


Install necessary packages

ggplot2 is included in the tidyverse package, so we can simply install and load tidyverse for everything we’ll cover in this session.

if (!requireNamespace(c("tidyverse","ggtext"), quietly = TRUE))
  install.packages(c("tidyverse","ggtext"))

library(tidyverse)
library(ggtext)

Understanding ggplot syntax

There are three fundamental elements that go into constructing a plot with ggplot2:

Data - dataframe to be plotted data = dataframe

Aesthetics - maps variables to elements of the plot (i.e. x axis, y axis, color scheme, etc.) mapping = aes()

Geometry/Layers - visual elements used for the data + geom_function()

The typical input code for ggplot will usually look something like this:

ggplot(data = df, # input data
       mapping = aes(x = var1, # input mapping aesthetics
                     y = var2,
                     color = var3)) +
  geom_point() # add plotting layer

Setting up data for success

One of the most important parts of getting ready to plot data in R is ensuring that your data are “tidy”. When passing instructions to ggplot2, the program interprets dataframes in a fixed way:

columns are variables

rows are observations

Let’s examine a dataframe to better understand how ggplot2 interprets data. Here, we will be using the Iris dataset. The Iris dataset is built-in to R and was introduced by British statistician and biologist Ronald A. Fisher in 1936. Fisher collected the data to study the variation in iris flowers of three different species: Iris setosa, Iris versicolor, and Iris virginica.

head(iris) # view the first six rows of the dataframe

Looking at the first rows of this dataframe we can see that each variable is contained in a column and each row is an observation. This means that if you have replicate measurements (as in this dataset, there are multiple measurements of each variable per species) you will need to have a row per replicate rather than storing the replicate data in columns.

Other important notes about this dataset:

Let’s build a plot!

ggplot2 builds plots in layers. You can start with a layer showing raw data, then continue to add up additional elements to produce your desired graph. This approach will help you reduce the gap between the expected outcomes in your head and the plots in reality.

ggplot(data = iris, # use 'data' argument to tell ggplot which dataframe we want to plot from
       mapping = aes(x = Petal.Length, # mapping determines which variables are assigned to plot elements
                     y = Petal.Width)) -> basicPlot

basicPlot

We have told ggplot2 which variables we want to plot on the x and y axes but we have not told ggplot2 which geometric elements (i.e. geoms) to use to construct the plot, so all we have are the axes and a blank plot.

What are geoms?

Geoms are the geometric objects (e.g. lines, bars, etc.) that determine how observations are rendered. Layering elements in a plot usually starts with adding geoms. Let’s add geom_point() to our basic plot to create a scatter plot:

ggplot(data = iris,
       aes(x = Petal.Length, # it is very common to see 'mapping =' omitted from the code - ggplot will accept either
           y = Petal.Width)) + # use a + to add elements to your plot
  geom_point() -> basicScatter

basicScatter

Another way you can add layers to a plot is by simply adding them to the end of the object that we assigned our first plot to. It is very common to see this in online guides and forums (such as Stack Overflow) where you might look for help with R coding:

basicPlot +
  geom_point()

Although this generates the same output, I would generally avoid making your plots this way - if you end up with something that isn’t quite working as expected I find it can be easier to fix if all your code is laid out in front of you, rather than having to revisit each individual step in the process of making your plot.

Now we’ll start to add some more elements to our mapping aesthetics to better illustrate our data.


IMPORTANT NOTE

The initial mapping that you specify in the ggplot2 command (i.e. axes, color, size, etc.) are by default used globally for the plot and are carried over to any geoms you add in the following code. Each geom can have it’s own separate mapping aesthetics, which can allow you to create more complex plots. If you ever run into issues where a geom is not behaving as you would expect, take a look back through your code and check where your aesthetics were assigned, and how they apply to the geoms you are trying to layer.


Let’s color our points by Species:

ggplot(data=iris, 
       mapping = aes(x=Petal.Length, 
                     y=Petal.Width, 
                     color = Species)) + # tell ggplot that color is determined by Species variable
  geom_point() -> colorScatter
          
colorScatter

Now we can add a simple regression using geom_smooth() and we can demonstrate how changing global vs. specific aesthetics affect geoms. Some geoms have specialized arguments that allow them to function. In this case, geom_smooth() allows us to tell it which method to use to generate the curve that it will plot. We will opt for lm which is a linear model.

ggplot(data=iris, 
       mapping = aes(x=Petal.Length, 
                     y=Petal.Width, 
                     color = Species)) + 
  geom_point() +
  geom_smooth(method = lm) -> colorScatterlm
          
colorScatterlm

This creates three separate curves that map to the points from each Species by color, as this is what is specified in the global aesthetics. Let’s change that and plot a curve that spans all points:

ggplot(data=iris, 
       mapping = aes(x=Petal.Length, 
                     y=Petal.Width)) + # remove color from global aesthetics
  geom_point(aes(color = Species)) + # set geom_point aesthetics - this will only color points
  geom_smooth(method = lm) -> colorScatterlm2

colorScatterlm2

Now we can see that the points are still colored by Species, but the regression is not.

Let’s play with some more aesthetics:

ggplot(data=iris, 
       mapping = aes(x=Petal.Length, 
                     y=Petal.Width)) + 
  geom_point(aes(color = "blue")) + 
  geom_smooth(method = lm) -> basicPlotBlue

basicPlotBlue

Notice how even though we have changed the aesthetic of the points to be “blue”, it has not made them blue. If we want to make all the points one color (or a different shape, or a different size) these are not set by aesthetics, as they are not dependent on a variable.

Let’s make our points blue and change their shape:

ggplot(data = iris,
       mapping = aes(x = Petal.Length,
                     y = Petal.Width)) +
  geom_point(color = "blue", # note that these options are not parsed through the aes() argument
             size = 5,
             shape = 1) -> openCircleBlue

openCircleBlue


# shapes are defined by a numerical value
# available shapes can be viewed at https://www.datanovia.com/en/blog/ggplot-point-shapes-best-tips/ or by using ggpubr::show_point_shapes()

There are other ways we can control how each layer is rendered. Let’s start with controlling “scales”. Scales allow us to edit specific elements of the aesthetics and are named in a uniform manner than describes how they act and what they affect. The names are made up of three pieces separated by “_“:

Within the scales there are several options we can edit:

Let’s use scale_color_manual to manually select some colors for our plot.

(There are set named colors that can be used in R https://stat.columbia.edu/~tzheng/files/Rcolor.pdf but you can also use hex codes. Make sure to use color blind friendly color schemes for figures you plan to present or publish!)

ggplot(data = iris,
       mapping = aes(x = Petal.Length,
                     y = Petal.Width,
                     color = Species)) +
  geom_point() +
  geom_smooth(method = lm) +
  scale_color_manual(name = "Iris species",
                     values = c("setosa" = "pink",
                                "versicolor" = "plum",
                                "virginica"="seagreen3"),
                     labels = c("Iris setosa",
                                "Iris versicolor",
                                "Iris virginica")) -> multiColor

multiColor

Now we have some different colors and data labels in our plot of the Iris data. The name and labels options in the scale are useful for changing how the data are labeled in your plot without needing to manipulate the raw data.

We can also use continuous color scales to visually represent changes in values:

ggplot(data = iris,
       mapping = aes(x = Petal.Length,
                     y = Petal.Width)) +
  geom_point(aes(color = Petal.Length)) +
  geom_smooth(method = lm) -> blueCont

blueCont

You can also use other variables within the dataframe to control the aesthetics of the plot.

ggplot(data = iris,
       mapping = aes(x = Petal.Length,
                     y = Petal.Width)) +
  geom_point(aes(color = Sepal.Length, # color is dependent on sepal length
                 size = Sepal.Width)) + # point size is dependent on sepal width
  geom_smooth(method = lm,
              color = "black",
              se = FALSE) + # change the color of the curve
  scale_color_gradient(high = "purple",
                       low = "orange") # manually set the colors of the gradient scale

Obviously there is a little too much data now contained in this plot for it to be particularly useful, but it is a good example of how much data you can display and the different ways you can present it using R.

Let’s tidy up our plot and make something that looks a little more “publication-ready”:

First, let’s assign our chosen color scale to a vector object so we can call the same colors for any future plots without needing to write out the code every time. This time I’m going to use a color blind friendly palette generated using this tool: https://davidmathlogic.com/colorblind.

plotColors <- c("setosa" = "#648FFF",
                "versicolor" = "#DC267F",
                "virginica"="#FFB000")
ggplot(data = iris,
       mapping = aes(x = Petal.Length,
                     y = Petal.Width)) +
  geom_point(aes(color = Species)) +
  geom_smooth(method = lm,
              color = "black") +
  theme_bw() + # this is a built-in theme that removes the gray plot background
  scale_color_manual(values = plotColors) + # direct ggplot to our color vector
  ylab("Petal width (mm)") + # change y axis label - can also be done with scales
  labs(x = "Petal length (mm)",
  color = "Iris species") + # change legend title - can also be done with scales as previously
  ggtitle("Petal width by petal length per species") + # add plot title 
  theme(plot.title = element_text(hjust = 0.5)) -> multiColorTidy # center plot title

multiColorTidy

Perfect! Now we can write our plot to a pdf:

pdf("multi.color.iris.plot.lm.pdf", # file name to write to
    height = 4, # plot height in inches
    width = 6) # plot width in inches

multiColorTidyLm # tell R which plot to write to file

dev.off() # this tells R that you're done creating a file

Or we can use ggsave(), which is a function of ggplot2 to save as any other graphics file type:

ggsave(plot = multiColorTidyLm, # specify plot
       "multi.color.iris.plot.lm.tiff", # specify file name
       height = 4, # plot height
       width = 6, # plot width
       units = c("in"), # specify which units to use for height and width
       device = "tiff") # specify file type for saving - ggsave will also guess depending on the extension used in file name 

Plotting group means and error bars

Another way that we may want to plot our data is by plotting both group means and individual data points. This can help people better visualize the spread of our data. This is easy enough with geoms like geom_boxplot() and geom_violin() that have group metrics built into their functionality.

ggplot(data = iris,
       aes(x = Species,
           y = Petal.Length,
           fill = Species)) + 
  geom_boxplot(outliers = FALSE) +
  geom_point()

You can see that geom_boxplot has automatically generated a box that displays the median (thick line) and box that spans the 25th - 75th percentiles, with whiskers that extend to the furthest value no more than 1.5 X the IQR from the box. Values beyond the whiskers would be counted as outliers and plotted separately. This is great for taking a quick summary look at your data.

But what if we want to use something like geom_bar() that does not have built in group functionality?

There are a couple of ways we could solve this issue using functions in the tidyverse package. First, we could create a summary table that contains grouped information for our data using summarise.

iris %>%
  group_by(Species) %>% # group_by tells R which variable to use to group observations
  summarise(mean.Petal.Length = mean(Petal.Length), # add a column containing mean values per species
            standard.deviation = sd(Petal.Length)) -> irisSummary # add a column containing standard deviation

head(irisSummary)

We can use the summarise function to create a new dataframe that contains a mean and standard deviation for each species. We can write this to a new object and then use this for plotting by providing each geom with a different dataframe.

ggplot() + # we do not want global mapping or data for this plot so none is put in the ggplot call
  geom_col(data = irisSummary, # set the dataframe for the columns
           aes(x = Species,
               y = mean.Petal.Length,
               fill = Species),
           alpha = 0.5) +
  geom_errorbar(data = irisSummary, # set the dataframe for the error bars
                aes(x = Species,
                    ymin = (mean.Petal.Length - standard.deviation), # set the minimum error bar value
                    ymax = (mean.Petal.Length + standard.deviation)), # set the maximum error bar value
                width = 0.2) +
  geom_jitter(data = iris, # set the dataframe for the points
              aes(x = Species,
                  y = Petal.Length,
                  color = Species),
              width = 0.2, # make the total spread of the points narrower
              shape = 1) # set the shape to open circle

Now we can see both the mean and individual values on our bar plot.

Another, more streamlined, way of doing this is using stat_summary, where we remove the need to create a separate dataframe by using functions within the ggplot package.

ggplot(data = iris,
       aes(x = Species,
           y = Petal.Length)) +
  stat_summary(geom = "col", # identify which geom we want
               fun.data = mean_se, # tell stat_summary which function to apply to summarise the data
               aes(fill = Species), # set aesthetics as normal
               alpha = 0.5) +
  stat_summary(geom = "errorbar",
               fun.data = mean_se,
               color = "black",
               width = 0.2) +
  geom_jitter(aes(color = Species),
              shape = 1,
              width = 0.2)

Voilà! We have almost same plot as above but with a step removed. However, you may have noticed that we used function mean_se, which calculates the mean and standard error for a vector of y values at each unique x value (i.e. the function receives a vector of values for Petal.Length for each Species) and most of the time we like to use standard deviation. stat_summary does not offer this function as part of the package - so what do we do? Create our own.

mean.sd <- function(x){
  tibble(y = mean(x), # tell the function that we want a tibble output (similar to dataframe)
         ymin = y - sd(x), # calculates the minimum value for error bar
         ymax = y + sd(x)) # calculates the maximum value for error bar
}

Now we can create our plot:

ggplot(data = iris,
       aes(x = Species,
           y = Petal.Length)) +
  stat_summary(geom = "col", 
               fun.data = mean.sd, 
               aes(fill = Species),
               alpha = 0.5) +
  stat_summary(geom = "errorbar",
               fun.data = mean.sd,
               color = "black",
               width = 0.2) +
  geom_jitter(aes(color = Species),
              shape = 1,
              width = 0.2)

Facets

Faceting is a technique that allows us to separate data out into panels based on a variable in the dataframe. This is useful for visualizing complex data where it may be easier to see patterns when the data are separated.

There are two methods to create facets in a plot: facet_wrap() and facet_grid(). If you are only creating facets based on one variable (e.g. species) you can use facet_wrap() but if you have a more complex plot where you want to create facets based on two variables (e.g. species and time point) you need to use facet_grid().

Let’s pull up another of R’s built-in datasets (mtcars) that will allow us to see both of these in action. mtcars is built from data extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

head(mtcars)

Let’s look at mpg (Miles per US Gallon) plotted against hp (Gross horsepower).

ggplot(data = mtcars,
       aes(x = hp,
           y = mpg,
           color = mpg)) +
  geom_point(size = 3)

Now let’s use facet_wrap() to split these data up by vs (Enginge shape, 0 = V, 1 = straight).

ggplot(data = mtcars,
       aes(x = hp,
           y = mpg,
           color = mpg)) +
  geom_point(size = 3) +
  facet_wrap(~ vs)

Let’s add another variable facet with facet_grid() and split the data by am (Transmission, 0 = automatic, 1 = manual) as well.

ggplot(data = mtcars,
       aes(x = hp,
           y = mpg,
           color = mpg)) +
  geom_point(size = 3) +
  facet_grid(cols = vars(vs), # assign a variable to the column panels
             rows = vars(am)) # assign a variable to the row panels

We can see that there are different correlations between hp and mpg depending on the other qualities of the car. However, this plot is now difficult to read because both variables are binaries, meaning it’s hard to tell what’s what. Let’s tidy up these plots and add some labels.

Changing the panel labels without changing the underlying data is slightly more complex than changing axis titles, so let’s look at how to do that.

vsLabs <- c("0" = "V-shaped",
            "1" = "Straight") # create a vector that matches the binary variables to their values

amLabs <- c("0" = "Automatic",
            "1" = "Manual") # do the same for the am variable

ggplot(data = mtcars,
       aes(x = hp,
           y = mpg,
           color = mpg)) +
  geom_point(size = 3) +
  facet_grid(cols = vars(vs),
             rows = vars(am),
             labeller = labeller(.cols = vsLabs, # use the labeller function to assign these labels to the rows and columns of the plot
                                 .rows = amLabs)) -> facetPlot

facetPlot

Let’s tidy the rest of this plot up and then save it to file.

facetPlot +
  scale_color_gradient(name = "Miles per\nUS Gallon", # \n starts a new line in the legend title
                       high = "purple",
                       low = "orange") + # change color of scale
  theme_bw() +
  xlab("Gross horsepower") + # add x axis title
  ylab("Miles per US Gallon") + # add y axis title
  theme(strip.background = element_rect(fill = "white")) -> facetPlotTidy # remove grey background from panel titles

facetPlotTidy

Now we can use the same methods as earlier to save our plot to either a PDF or image file (or both!).

pdf("multi.facet.mtcars.plot.pdf", # file name to write to
    height = 4, # plot height in inches
    width = 6) # plot width in inches

facetPlotTidy # tell R which plot to write to file

dev.off() # this tells R that you're done creating a file
ggsave(plot = facetPlotTidy, # specify plot
       "multi.facet.mtcars.plot.tiff", # specify file name
       height = 4, # plot height
       width = 6, # plot width
       units = c("in"), # specify which units to use for height and width
       device = "tiff") # specify file type for saving - ggsave will also guess depending on the extension used in file name

Plotting a time course

For many experiments, it’s important to be able to plot a time course. Let’s load in some example colony count data from an experiment growing four species of bacteria in both high and low iron conditions, with time points at 0, 6, and 24 hours.

read.csv("cfu_counts_raw.csv") -> counts # read in counts

Let’s take a quick look at the format of the data we just loaded and check that the format looks correct for plotting.

head(counts)

Our dataframe has columns as variables and rows as observations so we’re good to go!

In order to plot a time course as a discrete variable that runs along the x axis, we need to change the time variable from numeric to a factor in both the raw counts and group means dataframes. Factors can help us control the order in which observations are plotted. By default, ggplot will plot numeric variables in ascending order and character or factor variables in alphabetical order. So, we’ll also set the iron level as a factor because I want to plot the low iron condition before the high iron condition.

counts$time <- factor(counts$time)

counts$iron <- factor(counts$iron,
                      levels = c("Low iron","High iron"))

Now we can set our custom colors for the plot.

speciesCols <- c("Pseudomonas aeruginosa" = "#43ba8f",
            "Staphylococcus aureus" = "#fec44f",
            "Streptococcus sanguinis" = "#4292c6",
            "Burkholderia orbicola" = "#d57bd4")

Let’s create a line plot of log10 CFU/mL per species over time, with facets showing the high and low iron. We will plot a ribbon that represents the standard deviation (sd), thin lines that represent each replicate (tech.rep), and a thick line representing the mean CFU/mL for each species (mean.cfu). We’ll utilize the stat_summary function that we saw earlier.

counts %>%
  mutate(log10.cfu = log10(cfu)) %>% # create a new column with log10 CFU values
  ggplot(aes(x = time,
             y = log10.cfu)) +
  stat_summary(geom = "ribbon",
               fun.data = mean.sd,
               aes(group = species, # group aesthetic specifies how the lines are joined together
                   fill = species),
               alpha = 0.5) +
  geom_line(aes(group = interaction(species,iron,tech.rep),
                color = species),
            linewidth = 0.1,
            alpha = 0.7) +
  stat_summary(geom = "line",
               fun.data = mean.sd,
               aes(color = species,
                   group = interaction(species,iron)),
               linewidth = 1) +
  scale_color_manual(name = "Species", # both color and fill must have the same name if we want to combine the legend
                     values = speciesCols) +
  scale_fill_manual(name = "Species",
                    values = speciesCols) +
  labs(x = "Time (h)", # set axis labels
       y = "Log<sub>10</sub> CFU/mL") +
  theme_bw() + # remove grey background
  facet_grid(cols = vars(iron)) + # facet plot by high/low iron
  theme(strip.background = element_rect(fill = "white", # remove grey background from facet titles
                                        color = "black"),
        legend.text = element_text(face = "italic"), # set legend font to italic
        legend.position = "inside", # move legend inside bounds of plot
        legend.position.inside = c(0.8,0.2), # use a vector to set x and y position (0 - 1)
        legend.background = element_rect(fill = "white", # set box around legend
                                         color = "black"),
        ,
        axis.title.y = element_markdown()) # allow axis title to read html code for subscript

Activities

Green 1

Create a scatter plot using the columns Sepal.Length (x) and Sepal.Width (y) from the iris dataset.

ggplot(data = iris, 
       aes(x = Sepal.Length, 
           y = Sepal.Width)) + 
  geom_point(aes(color = Species)) +
  geom_smooth(method = lm)

Green 2

Make a plot where all the points are green, and the line is colored by the species of iris.

ggplot(data = iris,
       aes(x = Petal.Length,
           y = Petal.Width)) +
  geom_point(color = "green") +
  geom_smooth(aes(color = Species),
              method = lm)

Blue 1

Make a plot that includes regression lines for individual species as well as the overall data.

ggplot(data = iris, 
       aes(x = Petal.Length, 
           y = Petal.Width)) +
  geom_point(aes(color = Species)) +
  geom_smooth(aes(color = Species), 
              method = lm) +
  geom_smooth(method = lm, 
              color="blue")
---
title: "Introduction to ggplot"
author: "Yasmin Hilliam, PhD"
date: "2025-07-05"
output: html_notebook
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE,
                      collapse = TRUE)
```


Whilst base `R` plots are quick and useful for examining our data, they don't always offer the flexibility and attractive customization options that we'd like for a presentation or manuscript. This is where `ggplot2` comes in.

This session will teach you the basics of using `ggplot2` to visualize data in R. ggplot was developed in 2005 by Hadley Wickham as an open source data visualization package for R. With `ggplot2`, you can create plots that range from simple scatter diagrams to complex custom plots that are (almost) completely customizable.

Resources:

<https://towardsdatascience.com/guide-to-data-visualization-with-ggplot2-in-a-hour-634c7e3bc9dd>

<https://ggplot2.tidyverse.org/reference/>

<https://www.youtube.com/@Riffomonas>

------------------------------------------------------------------------

#### Install necessary packages

`ggplot2` is included in the `tidyverse` package, so we can simply install and load `tidyverse` for everything we'll cover in this session.

```{r, message = FALSE}
if (!requireNamespace(c("tidyverse","ggtext"), quietly = TRUE))
  install.packages(c("tidyverse","ggtext"))

library(tidyverse)
library(ggtext)
```


#### Understanding ggplot syntax

There are three fundamental elements that go into constructing a plot with `ggplot2`:

**Data** - dataframe to be plotted `data = dataframe`

**Aesthetics** - maps variables to elements of the plot (i.e. x axis, y axis, color scheme, etc.) `mapping = aes()`

**Geometry/Layers** - visual elements used for the data `+ geom_function()`

The typical input code for ggplot will usually look something like this:

```{r, eval = FALSE}
ggplot(data = df, # input data
       mapping = aes(x = var1, # input mapping aesthetics
                     y = var2,
                     color = var3)) +
  geom_point() # add plotting layer
```


#### Setting up data for success

One of the most important parts of getting ready to plot data in `R` is ensuring that your data are "tidy". When passing instructions to `ggplot2`, the program interprets dataframes in a fixed way:

**columns** are variables

**rows** are observations

Let's examine a dataframe to better understand how `ggplot2` interprets data. Here, we will be using the Iris dataset. The Iris dataset is built-in to `R` and was introduced by British statistician and biologist Ronald A. Fisher in 1936. Fisher collected the data to study the variation in iris flowers of three different species: Iris setosa, Iris versicolor, and Iris virginica.

```{r, message = FALSE}
head(iris) # view the first six rows of the dataframe
```


Looking at the first rows of this dataframe we can see that each **variable** is contained in a column and each row is an **observation**. This means that if you have replicate measurements (as in this dataset, there are multiple measurements of each variable per species) you will need to have a row *per replicate* rather than storing the replicate data in columns.

Other important notes about this dataset:

-   it consists of four numeric variables (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) and one categorical variable (Species). This structured format makes it easy to map variables to aesthetics in `ggplot2`.

-   the Iris dataset has a balanced class distribution. Each of the three species (setosa, versicolor, virginica) has an equal number of observations. This balance allows for fair visual comparisons and avoids potential biases that can arise from imbalanced datasets.

-   column names and features contain no spaces or "-". `R` doesn't usually like these.

#### Let's build a plot! 

`ggplot2` builds plots in layers. You can start with a layer showing raw data, then continue to add up additional elements to produce your desired graph. This approach will help you reduce the gap between the expected outcomes in your head and the plots in reality.

```{r}
ggplot(data = iris, # use 'data' argument to tell ggplot which dataframe we want to plot from
       mapping = aes(x = Petal.Length, # mapping determines which variables are assigned to plot elements
                     y = Petal.Width)) -> basicPlot

basicPlot
```


We have told `ggplot2` which variables we want to plot on the x and y axes but we have not told `ggplot2` which geometric elements (i.e. geoms) to use to construct the plot, so all we have are the axes and a blank plot.

#### What are geoms?

Geoms are the geometric objects (e.g. lines, bars, etc.) that determine how observations are rendered. Layering elements in a plot usually starts with adding geoms. Let's add `geom_point()` to our basic plot to create a scatter plot:

```{r}
ggplot(data = iris,
       aes(x = Petal.Length, # it is very common to see 'mapping =' omitted from the code - ggplot will accept either
           y = Petal.Width)) + # use a + to add elements to your plot
  geom_point() -> basicScatter

basicScatter
```


Another way you can add layers to a plot is by simply adding them to the end of the object that we assigned our first plot to. It is very common to see this in online guides and forums (such as Stack Overflow) where you might look for help with `R` coding:

```{r}
basicPlot +
  geom_point()
```


Although this generates the same output, I would generally avoid making your plots this way - if you end up with something that isn't quite working as expected I find it can be easier to fix if all your code is laid out in front of you, rather than having to revisit each individual step in the process of making your plot.

Now we'll start to add some more elements to our mapping aesthetics to better illustrate our data.

------------------------------------------------------------------------

**IMPORTANT NOTE**

The initial mapping that you specify in the `ggplot2` command (i.e. axes, color, size, etc.) are by default used globally for the plot and are carried over to any geoms you add in the following code. Each geom *can* have it's own separate mapping aesthetics, which can allow you to create more complex plots. If you ever run into issues where a geom is not behaving as you would expect, take a look back through your code and check where your aesthetics were assigned, and how they apply to the geoms you are trying to layer.

------------------------------------------------------------------------

Let's color our points by Species:

```{r}
ggplot(data=iris, 
       mapping = aes(x=Petal.Length, 
                     y=Petal.Width, 
                     color = Species)) + # tell ggplot that color is determined by Species variable
  geom_point() -> colorScatter
          
colorScatter
```


Now we can add a simple regression using `geom_smooth()` and we can demonstrate how changing global vs. specific aesthetics affect geoms. Some geoms have specialized arguments that allow them to function. In this case, `geom_smooth()` allows us to tell it which method to use to generate the curve that it will plot. We will opt for `lm` which is a linear model.

```{r}
ggplot(data=iris, 
       mapping = aes(x=Petal.Length, 
                     y=Petal.Width, 
                     color = Species)) + 
  geom_point() +
  geom_smooth(method = lm) -> colorScatterlm
          
colorScatterlm
```


This creates three separate curves that map to the points from each Species by color, as this is what is specified in the global aesthetics. Let's change that and plot a curve that spans all points:

```{r}
ggplot(data=iris, 
       mapping = aes(x=Petal.Length, 
                     y=Petal.Width)) + # remove color from global aesthetics
  geom_point(aes(color = Species)) + # set geom_point aesthetics - this will only color points
  geom_smooth(method = lm) -> colorScatterlm2

colorScatterlm2
```


Now we can see that the points are still colored by Species, but the regression is not.

Let's play with some more aesthetics:

```{r}
ggplot(data=iris, 
       mapping = aes(x=Petal.Length, 
                     y=Petal.Width)) + 
  geom_point(aes(color = "blue")) + 
  geom_smooth(method = lm) -> basicPlotBlue

basicPlotBlue
```


Notice how even though we have changed the aesthetic of the points to be "blue", it has not made them blue. If we want to make all the points one color (or a different shape, or a different size) these are *not* set by aesthetics, as they are not dependent on a variable.

Let's make our points blue and change their shape:

```{r}
ggplot(data = iris,
       mapping = aes(x = Petal.Length,
                     y = Petal.Width)) +
  geom_point(color = "blue", # note that these options are not parsed through the aes() argument
             size = 5,
             shape = 1) -> openCircleBlue

openCircleBlue

# shapes are defined by a numerical value
# available shapes can be viewed at https://www.datanovia.com/en/blog/ggplot-point-shapes-best-tips/ or by using ggpubr::show_point_shapes()
```


There are other ways we can control how each layer is rendered. Let's start with controlling "scales". Scales allow us to edit specific elements of the aesthetics and are named in a uniform manner than describes how they act and what they affect. The names are made up of three pieces separated by "_":

-   `scale`
-   the name of the aesthetic (e.g. `color`, `shape`, `size`, etc.)
-   the name of the scale (e.g. `manual`, `continuous`, `discrete`, etc.)

Within the scales there are several options we can edit: 

- `name =` this controls what the variable is called within the plot or legend 
- `values =` this allows us to manually input the variables (i.e. colors or shapes) used in the plot 
- `labels =` this controls the data labels (i.e. species names) in the plot or legend

Let's use `scale_color_manual` to manually select some colors for our plot.

(There are set named colors that can be used in `R` <https://stat.columbia.edu/~tzheng/files/Rcolor.pdf> but you can also use hex codes. Make sure to use color blind friendly color schemes for figures you plan to present or publish!)

```{r}
ggplot(data = iris,
       mapping = aes(x = Petal.Length,
                     y = Petal.Width,
                     color = Species)) +
  geom_point() +
  geom_smooth(method = lm) +
  scale_color_manual(name = "Iris species",
                     values = c("setosa" = "pink",
                                "versicolor" = "plum",
                                "virginica"="seagreen3"),
                     labels = c("Iris setosa",
                                "Iris versicolor",
                                "Iris virginica")) -> multiColor

multiColor
```


Now we have some different colors and data labels in our plot of the Iris data. The `name` and `labels` options in the `scale` are useful for changing how the data are labeled in your plot *without* needing to manipulate the raw data.

We can also use continuous color scales to visually represent changes in values:

```{r}
ggplot(data = iris,
       mapping = aes(x = Petal.Length,
                     y = Petal.Width)) +
  geom_point(aes(color = Petal.Length)) +
  geom_smooth(method = lm) -> blueCont

blueCont
```


You can also use other variables within the dataframe to control the aesthetics of the plot.

```{r}
ggplot(data = iris,
       mapping = aes(x = Petal.Length,
                     y = Petal.Width)) +
  geom_point(aes(color = Sepal.Length, # color is dependent on sepal length
                 size = Sepal.Width)) + # point size is dependent on sepal width
  geom_smooth(method = lm,
              color = "black",
              se = FALSE) + # change the color of the curve
  scale_color_gradient(high = "purple",
                       low = "orange") # manually set the colors of the gradient scale
```


Obviously there is a little too much data now contained in this plot for it to be particularly useful, but it is a good example of how much data you can display and the different ways you can present it using `R`.

Let's tidy up our plot and make something that looks a little more "publication-ready":

First, let's assign our chosen color scale to a vector object so we can call the same colors for any future plots without needing to write out the code every time. This time I'm going to use a color blind friendly palette generated using this tool: <https://davidmathlogic.com/colorblind>.

```{r}
plotColors <- c("setosa" = "#648FFF",
                "versicolor" = "#DC267F",
                "virginica"="#FFB000")
```

```{r}
ggplot(data = iris,
       mapping = aes(x = Petal.Length,
                     y = Petal.Width)) +
  geom_point(aes(color = Species)) +
  geom_smooth(method = lm,
              color = "black") +
  theme_bw() + # this is a built-in theme that removes the gray plot background
  scale_color_manual(values = plotColors) + # direct ggplot to our color vector
  ylab("Petal width (mm)") + # change y axis label - can also be done with scales
  labs(x = "Petal length (mm)",
  color = "Iris species") + # change legend title - can also be done with scales as previously
  ggtitle("Petal width by petal length per species") + # add plot title 
  theme(plot.title = element_text(hjust = 0.5)) -> multiColorTidy # center plot title

multiColorTidy
```


Perfect! Now we can write our plot to a pdf:

```{r}
pdf("multi.color.iris.plot.lm.pdf", # file name to write to
    height = 4, # plot height in inches
    width = 6) # plot width in inches

multiColorTidyLm # tell R which plot to write to file

dev.off() # this tells R that you're done creating a file
```


Or we can use `ggsave()`, which is a function of `ggplot2` to save as any other graphics file type:

```{r}
ggsave(plot = multiColorTidyLm, # specify plot
       "multi.color.iris.plot.lm.tiff", # specify file name
       height = 4, # plot height
       width = 6, # plot width
       units = c("in"), # specify which units to use for height and width
       device = "tiff") # specify file type for saving - ggsave will also guess depending on the extension used in file name 
```


#### Plotting group means and error bars

Another way that we may want to plot our data is by plotting both group means *and* individual data points. This can help people better visualize the spread of our data. This is easy enough with geoms like `geom_boxplot()` and `geom_violin()` that have group metrics built into their functionality.

```{r}
ggplot(data = iris,
       aes(x = Species,
           y = Petal.Length,
           fill = Species)) + 
  geom_boxplot(outliers = FALSE) +
  geom_point()
```


You can see that `geom_boxplot` has automatically generated a box that displays the median (thick line) and box that spans the 25th - 75th percentiles, with whiskers that extend to the furthest value no more than 1.5 X the IQR from the box. Values beyond the whiskers would be counted as outliers and plotted separately. This is great for taking a quick summary look at your data. 

But what if we want to use something like `geom_bar()` that does not have built in group functionality?

There are a couple of ways we could solve this issue using functions in the `tidyverse` package. First, we could create a summary table that contains grouped information for our data using `summarise`.

```{r}
iris %>%
  group_by(Species) %>% # group_by tells R which variable to use to group observations
  summarise(mean.Petal.Length = mean(Petal.Length), # add a column containing mean values per species
            standard.deviation = sd(Petal.Length)) -> irisSummary # add a column containing standard deviation

head(irisSummary)
```


We can use the `summarise` function to create a new dataframe that contains a mean and standard deviation for each species. We can write this to a new object and then use this for plotting by providing each `geom` with a different dataframe.

```{r}
ggplot() + # we do not want global mapping or data for this plot so none is put in the ggplot call
  geom_col(data = irisSummary, # set the dataframe for the columns
           aes(x = Species,
               y = mean.Petal.Length,
               fill = Species),
           alpha = 0.5) +
  geom_errorbar(data = irisSummary, # set the dataframe for the error bars
                aes(x = Species,
                    ymin = (mean.Petal.Length - standard.deviation), # set the minimum error bar value
                    ymax = (mean.Petal.Length + standard.deviation)), # set the maximum error bar value
                width = 0.2) +
  geom_jitter(data = iris, # set the dataframe for the points
              aes(x = Species,
                  y = Petal.Length,
                  color = Species),
              width = 0.2, # make the total spread of the points narrower
              shape = 1) # set the shape to open circle
```


Now we can see both the mean and individual values on our bar plot.

Another, more streamlined, way of doing this is using `stat_summary`, where we remove the need to create a separate dataframe by using functions within the `ggplot` package. 

```{r}
ggplot(data = iris,
       aes(x = Species,
           y = Petal.Length)) +
  stat_summary(geom = "col", # identify which geom we want
               fun.data = mean_se, # tell stat_summary which function to apply to summarise the data
               aes(fill = Species), # set aesthetics as normal
               alpha = 0.5) +
  stat_summary(geom = "errorbar",
               fun.data = mean_se,
               color = "black",
               width = 0.2) +
  geom_jitter(aes(color = Species),
              shape = 1,
              width = 0.2)
```


Voilà! We have *almost* same plot as above but with a step removed. However, you may have noticed that we used function `mean_se`, which calculates the mean and standard error for a vector of `y` values at each unique `x` value (*i.e.* the function receives a vector of values for Petal.Length for each Species) and most of the time we like to use standard deviation. `stat_summary` does not offer this function as part of the package - so what do we do? Create our own.

```{r}
mean.sd <- function(x){
  tibble(y = mean(x), # tell the function that we want a tibble output (similar to dataframe)
         ymin = y - sd(x), # calculates the minimum value for error bar
         ymax = y + sd(x)) # calculates the maximum value for error bar
}
```


Now we can create our plot:

```{r}
ggplot(data = iris,
       aes(x = Species,
           y = Petal.Length)) +
  stat_summary(geom = "col", 
               fun.data = mean.sd, 
               aes(fill = Species),
               alpha = 0.5) +
  stat_summary(geom = "errorbar",
               fun.data = mean.sd,
               color = "black",
               width = 0.2) +
  geom_jitter(aes(color = Species),
              shape = 1,
              width = 0.2)
```

#### Facets

Faceting is a technique that allows us to separate data out into panels based on a variable in the dataframe. This is useful for visualizing complex data where it may be easier to see patterns when the data are separated.

There are two methods to create facets in a plot: `facet_wrap()` and `facet_grid()`. If you are only creating facets based on one variable (e.g. species) you can use `facet_wrap()` but if you have a more complex plot where you want to create facets based on two variables (e.g. species *and* time point) you need to use `facet_grid()`.

Let's pull up another of `R`'s built-in datasets (mtcars) that will allow us to see both of these in action. mtcars is built from data extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973--74 models).

```{r}
head(mtcars)
```


Let's look at mpg (Miles per US Gallon) plotted against hp (Gross horsepower).

```{r}
ggplot(data = mtcars,
       aes(x = hp,
           y = mpg,
           color = mpg)) +
  geom_point(size = 3)
```


Now let's use `facet_wrap()` to split these data up by vs (Enginge shape, 0 = V, 1 = straight).

```{r}
ggplot(data = mtcars,
       aes(x = hp,
           y = mpg,
           color = mpg)) +
  geom_point(size = 3) +
  facet_wrap(~ vs)
```


Let's add another variable facet with `facet_grid()` and split the data by am (Transmission, 0 = automatic, 1 = manual) as well.

```{r}
ggplot(data = mtcars,
       aes(x = hp,
           y = mpg,
           color = mpg)) +
  geom_point(size = 3) +
  facet_grid(cols = vars(vs), # assign a variable to the column panels
             rows = vars(am)) # assign a variable to the row panels
```


We can see that there are different correlations between hp and mpg depending on the other qualities of the car. However, this plot is now difficult to read because both variables are binaries, meaning it's hard to tell what's what. Let's tidy up these plots and add some labels.

Changing the panel labels without changing the underlying data is slightly more complex than changing axis titles, so let's look at how to do that.

```{r}
vsLabs <- c("0" = "V-shaped",
            "1" = "Straight") # create a vector that matches the binary variables to their values

amLabs <- c("0" = "Automatic",
            "1" = "Manual") # do the same for the am variable

ggplot(data = mtcars,
       aes(x = hp,
           y = mpg,
           color = mpg)) +
  geom_point(size = 3) +
  facet_grid(cols = vars(vs),
             rows = vars(am),
             labeller = labeller(.cols = vsLabs, # use the labeller function to assign these labels to the rows and columns of the plot
                                 .rows = amLabs)) -> facetPlot

facetPlot
```


Let's tidy the rest of this plot up and then save it to file.

```{r}
facetPlot +
  scale_color_gradient(name = "Miles per\nUS Gallon", # \n starts a new line in the legend title
                       high = "purple",
                       low = "orange") + # change color of scale
  theme_bw() +
  xlab("Gross horsepower") + # add x axis title
  ylab("Miles per US Gallon") + # add y axis title
  theme(strip.background = element_rect(fill = "white")) -> facetPlotTidy # remove grey background from panel titles

facetPlotTidy
```


Now we can use the same methods as earlier to save our plot to either a PDF or image file (or both!).

```{r}
pdf("multi.facet.mtcars.plot.pdf", # file name to write to
    height = 4, # plot height in inches
    width = 6) # plot width in inches

facetPlotTidy # tell R which plot to write to file

dev.off() # this tells R that you're done creating a file
```

```{r}
ggsave(plot = facetPlotTidy, # specify plot
       "multi.facet.mtcars.plot.tiff", # specify file name
       height = 4, # plot height
       width = 6, # plot width
       units = c("in"), # specify which units to use for height and width
       device = "tiff") # specify file type for saving - ggsave will also guess depending on the extension used in file name
```


#### Plotting a time course

For many experiments, it's important to be able to plot a time course. Let's load in some example colony count data from an experiment growing four species of bacteria in both high and low iron conditions, with time points at 0, 6, and 24 hours.

```{r}
read.csv("cfu_counts_raw.csv") -> counts # read in counts
```


Let's take a quick look at the format of the data we just loaded and check that the format looks correct for plotting.

```{r}
head(counts)
```


Our dataframe has columns as **variables** and rows as **observations** so we're good to go! 

In order to plot a time course as a discrete variable that runs along the x axis, we need to change the `time` variable from numeric to a factor in both the raw counts and group means dataframes. Factors can help us control the order in which observations are plotted. By default, ggplot will plot numeric variables in ascending order and character or factor variables in alphabetical order. So, we'll also set the iron level as a factor because I want to plot the low iron condition *before* the high iron condition.

```{r}
counts$time <- factor(counts$time)

counts$iron <- factor(counts$iron,
                      levels = c("Low iron","High iron"))
```


Now we can set our custom colors for the plot.

```{r}
speciesCols <- c("Pseudomonas aeruginosa" = "#43ba8f",
            "Staphylococcus aureus" = "#fec44f",
            "Streptococcus sanguinis" = "#4292c6",
            "Burkholderia orbicola" = "#d57bd4")
```


Let's create a line plot of log10 CFU/mL per species over time, with facets showing the high and low iron. We will plot a ribbon that represents the standard deviation (`sd`), thin lines that represent each replicate (`tech.rep`), and a thick line representing the mean CFU/mL for each species (`mean.cfu`). We'll utilize the `stat_summary` function that we saw earlier.

```{r}
counts %>%
  mutate(log10.cfu = log10(cfu)) %>% # create a new column with log10 CFU values
  ggplot(aes(x = time,
             y = log10.cfu)) +
  stat_summary(geom = "ribbon",
               fun.data = mean.sd,
               aes(group = species, # group aesthetic specifies how the lines are joined together
                   fill = species),
               alpha = 0.5) +
  geom_line(aes(group = interaction(species,iron,tech.rep),
                color = species),
            linewidth = 0.1,
            alpha = 0.7) +
  stat_summary(geom = "line",
               fun.data = mean.sd,
               aes(color = species,
                   group = interaction(species,iron)),
               linewidth = 1) +
  scale_color_manual(name = "Species", # both color and fill must have the same name if we want to combine the legend
                     values = speciesCols) +
  scale_fill_manual(name = "Species",
                    values = speciesCols) +
  labs(x = "Time (h)", # set axis labels
       y = "Log<sub>10</sub> CFU/mL") +
  theme_bw() + # remove grey background
  facet_grid(cols = vars(iron)) + # facet plot by high/low iron
  theme(strip.background = element_rect(fill = "white", # remove grey background from facet titles
                                        color = "black"),
        legend.text = element_text(face = "italic"), # set legend font to italic
        legend.position = "inside", # move legend inside bounds of plot
        legend.position.inside = c(0.8,0.2), # use a vector to set x and y position (0 - 1)
        legend.background = element_rect(fill = "white", # set box around legend
                                         color = "black"),
        ,
        axis.title.y = element_markdown()) # allow axis title to read html code for subscript
```


### Activities

#### Green 1

Create a scatter plot using the columns Sepal.Length (x) and Sepal.Width (y) from the iris dataset.

```{r}
ggplot(data = iris, 
       aes(x = Sepal.Length, 
           y = Sepal.Width)) + 
  geom_point(aes(color = Species)) +
  geom_smooth(method = lm)
```


#### Green 2

Make a plot where all the points are green, and the line is colored by the species of iris.

```{r}
ggplot(data = iris,
       aes(x = Petal.Length,
           y = Petal.Width)) +
  geom_point(color = "green") +
  geom_smooth(aes(color = Species),
              method = lm)
```


#### Blue 1

Make a plot that includes regression lines for individual species as well as the overall data.

```{r}
ggplot(data = iris, 
       aes(x = Petal.Length, 
           y = Petal.Width)) +
  geom_point(aes(color = Species)) +
  geom_smooth(aes(color = Species), 
              method = lm) +
  geom_smooth(method = lm, 
              color="blue")
```